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Abstract 

As in many other areas of science, systems biology makes extensive 
use of statistical association and significance estimates in contingency 
tables, a type of categorical data analysis known in this field as enrich- 
ment (also over-representation or enhancement) analysis. In spite of 
efforts to create probabilistic annotations, especially in the Gene On- 
tology context, or to deal with uncertainty in high throughput-based 
datasets, current enrichment methods largely ignore this probabilis- 
tic information since they are mainly based on variants of the Fisher 
Exact Test. We developed an open-source R package to deal with 
probabilistic categorical data analysis, ProbCD, that does not require 
a static contingency table. The contingency table for the enrichment 
problem is built using the expectation of a Bernoulli Scheme stochas- 
tic process given the categorization probabilities. An on-line interface 
was created to allow usage by non-programmers and is available at: 
http://xerad.systemsbiology.net/ProbCD/. We present an analysis 
framework and software tools to address the issue of uncertainty in 
categorical data analysis. In particular, concerning the enrichment 
analysis, ProbCD can accommodate: (i) the stochastic nature of the 
high-throughput experimental techniques and (ii) probabilistic gene 
annotation. 
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Background 



The system-level approach to data analysis known as enrichment analysis 
(also known as over-representation or enhancement analysis) is now com- 
monplace. Moreover, the number of available software tools to perform such 
analysis is large (see flUl 122] for comprehensive reviews). The preferred way 
to formalize the enrichment problem is by means of a contingency table, often 
2x2. 

The mathematical problem is conceptually generic, being applied to di- 
verse types of data, such as genomics, transcriptomis or proteomics datasets; 
diverse types of analysis, including multiple and/or ordered outcomes; and 
diverse types of gene classification schemes, such as Gene Ontology (GO), 
KEGG or organism-specific ones. For a given ontology term t defining the set 
of genes Gt and its complementary set G^, the general enrichment analysis 
contingency table is: 





Gt 




outcomei 




Xl^2 


outcome2 




-^2,2 


outcome k 




Xk,2 



Besides measuring the statistical significance of the null hypothesis that 
the rows and columns are independent, as yielded by Fisher's Exact Test 
[13] and Fisher-like methods [10l[22], it is also possible to measure statistical 
association between a table's rows and columns [12] (a detailed discussion on 
significance vs. association in the enrichment problem context can be found 
in [25]). 

Most of the attention in the enrichment analysis problem has focused on 
issues such as the search for the best multiple-test correction or the imple- 
mentation of better user-friendly software interfaces to facilitate biologist's 
exploratory work [in]. However, one of the limitations that the available 
approaches still share is that they assume, explicitly or implicitly, that one 
is able to construct the contingency table exactly, without uncertainty in 
populating its cells. 

Recently, the computational biology community has been witnessing an 



2 



increasing interest in probabilistic approaches to gene annotation, particu- 
larly in the Gene Ontology (GO) context, as a realization of the limitations 
imposed by the traditional deterministic and context-independent gene an- 
notation schemes |I71ll8l|12l[2Tl[IIll9l|26l|16]. These efforts are motivated 
by: the necessity to assess the error propagation in automatic gene anno- 
tation [ISl [IS] ; desire to include different types of evidence sources such as 
protein-protein interaction [171 E] or phylogenomics [121 E] annotation 
extrapolation from model organisms to others [261 |2T]. Meanwhile, the prob- 
abilistic nature of data obtained by high-throughput measurement techniques 
is well recognized and a number of attempts to model it were proposed over 
the past decade in various experimental contexts [2Zl EH]- However, these 
efforts are not integrally taken into account when usual enrichment analysis 
is performed. 

We describe a computational solution that is able to deal with the uncer- 
tainty introduced in enrichment analysis due to: (i) the stochastic nature of 
the results obtained with such high-throughput experimental techniques or 
(ii) probabilistic gene annotation. 

Implementation 

ProbCD is an open-source software designed to perform probabilistic categor- 
ical data analysis. ProbCD is written in R [6] with a level of modularity that 
makes it suitable to be incorporated by existing development efforts of inte- 
grative tools [21]. To facilitate the usage by researchers with no knowledge 
of R, we implemented a user-friendly web-based interface for the software, 
which is not limited to any particular organism. The on-line interface and 
the source-code are available on the project's website [1]. 

The idea behind ProbCD 's implementation is to formally represent the 
intuitive process of building a contingency table in a probabilistic manner. 
Informally speaking, each element to be placed in the contingency table is 
not considered to be indivisible, but instead is "shared" , according to prob- 
abilistic rules, among the contingency table's cells in a manner that is con- 
ceptually similar to fuzzy membership. The theoretical and computational 
implementation aspects are described in detail below. 
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Without loss of generality, the following descriptions are applied consid- 
ering one particular ontology term t that is associated with a set of genes, 
named simply as Gt- It should be noted that Gt is not restricted to the Gene 
Ontology categorization and can be any kind of classification or annotation. 

The vector q contains a probabihstic annotation for all g of the organism's 

genes: q.j = F{genej G Gt) for j G {1, ■ ■ ■ ,g}. This probabilistic annotation 
is assumed to be given, typically obtained from some analysis process. The 
deterministic scenario corresponds simply to F{genej G Gt) G {0, 1}, and 
hence is a special case. 

The matrix P contains a probabilistic description for all k possible out- 
comes of the property being studied. Therefore, P is a k x g matrix with 
elements Pij = F{genej G outcomei) for j G {1, ■ ■ ■ ,g} and i G {1, ■ ■ ■ ,k}. 
This probabilistic description of the data uncertainty is assumed to be given. 

To motivate the general probabilistic model, it is useful to examine an 
arbitrary 2x2 example in the deterministic scenario: 





G 


G" 


H 










X2,2 



where all x's are the counts of a regular contingency table over the gene 
sets G and H. In its matrix representation: 

(2^1,1 Xi^2 \ _ f '^j'^{genejeH}'^{genejeG} S j l{genej e/f} l{genej eG":} \ 
2^2,1 X2,2 J \ ^j'^{gene^eH''}'i-{genejeG} J2j^{9enejeH'=}'i-{genejeG'=} J 

where 1{} is the indicator function. 

Inspired by this representation, it is easy to see that the "hard" indica- 
tor functions may be substituted by Bernoulli random variables in order to 
account for the categorization uncertainty. Since all sets are finite, the indi- 
cator functions can be represented as vectors in {0, 1}^ and the sums over all 
genes as dot products. In a generic scenario, with given non-deterministic 
P and q, the contingency table represented by X|P,q is a random matrix 
that is difficult to describe in closed form. It is also not compatible with 
the statistical formalism supporting Fisher's Exact Test or other well-known 
Fisher-like approaches, as these are not applicable to random tables. 
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The contingency table is defined in terms of Bernoulli Schemes [2] which 
is the generalization of the Bernoulli Process to more than two possible out- 
comes. The notation Z ~ Be(pi, ■ ■ ■ represents the distribution: 



2 = < 



' (1,0,0,--- ,0) 
(0,1,0,--- ,0) 
(0,0,1,--- ,0) 



with probability pi ; 
with probability 
with probability ps; 



(0,0,0,--- , 

^ Pi H hp, 



'n 



1) 



1 



with probability 



The random variable X is a matrix representation of a /c x 2 contingency 



where ■ is the usual dot-product, = (Ajj,--- ,Ai^g) is a row-vector of 
a 2 X (7 binary matrix A such that {^A\^j^ A2,j)\qj ~ Be(gj, 1 — q^) and dj = 

[Di^u ■ ■ ■ , -Dj.g) is a row-vector of a kxg binary matrix D such that {Dij, ■ ■ ■ , Dkj) \ {Pij, ■ ■ ■ , Pk,j) ^ 



It is very easy to extend this framework for completely generic kxm tables 
(m > 2), but this would be outside the scope of the ontology enrichment 
problem. 

To measure statistical association between rows and columns in an or- 
dered contingency tables, as is analogously made when correlations are calcu- 
lated for non-categorical data, Goodman-Kruskal's gamma 7(X) can be used 
[T5|, [HI [25] . ProbCD calculates the statistical association accounting for the 
stochastic nature of the table's categorization reporting 7 = 7(E[X|P,g]), 
where E is the expectation operator. This is a particular association mea- 
surement that can be easily changed for other user-implemented options. 



The dichotomous case, which is the simplest one, gives a more intuitive 
illustration on how the association is calculated in practice for the particular 
implementation: E[Xi_i|P,g] = Ei^i = Pi^iQi + ■■■ + Pi,gqg, E[X2,i|P,g] = 
^2,1 = (l-Pi,i)gi + --- + (l-PiJg„E[Xi,2|P,q] = ^1,2 = Pi,i(l-gi) + --- + 



table: 




Be(Pi,„--- ,Pk,j). 
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Pi,,(l - q,), E[X2,2|P, q] = £^2,2 = (1 - Pi,i)(l - gi) + • • • + (1 - - q,) 

and 7 = (£'i,2-E'2,2 — -E'i,2-E'2,i)/(£'i,2-E'2,2 + -E'i,2-E'2,i)- 

To measure the statistical significance of tlie estimated association, ProbCD 
uses a randomization approach. The null distribution for the Goodman- 
Kruskal's gamma, 7*, is proposed to be estimated from several permuta- 
tion rounds. In each round a gene j receives randomly its probabilities 
{Pij^ ■ ■ ■ jPkj) froiii one of the g possible columns of P and an association 
value is calculated. The significance of the statistical association between 
rows and columns in the contingency table is calculated as p = P(7* > 7). 
A term t is significantly over-represented (or equivalently, the gene list is 
enriched for t) depending on user-defined thresholds for significance and/or 
association. 

Results 

The following examples illustrate the potential utility of considering prob- 
abilistic annotations and/or data uncertainty assessment in the enrichment 
analysis using ProbCD on artificial datasets and a published yeast dataset. 

The point of the following illustration is to show that even ontology terms 
annotated with modest probabilities can be considered to be over-represented 
if the list of genes obtained behave in a supportive pattern. Consider a hypo- 
thetical organism with 100 genes annotated in several GO terms, as described 
in the Additional Files. The genes genei to gene2o are detcrministically an- 
notated to the ontology term t = a. In other words, assume that it is well 
known that these 20 genes have some given functionality a. The experiment, 
for example from a hypothetical proteomics dataset, yielded a deterministic 
list of differentially expressed (DE) genes ranging from genei to gene\Q. The 
contingency table for this problem is, therefore: 





Ga 


Gl 


DE 


10 







10 


80 



In this case, the gene list is clearly enriched for a within any meaningful 
significance cutoff. Consider now a second ontology term h obtained from 
a probability-based source with F^genCi e Gh) = 40%, i e {1, • " " )20}. A 
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probability of only 40% generally would not be sufficient evidence to warrant 
the inclusion of those 20 genes in Gb considering a usual deterministic frame- 
work and, therefore, would not be analyzed by deterministic-based methods, 
such as the Fisher's Exact Test. However, ProbCD is able to incorporate 
this information and yields: 7 = 0.87 and p < 10~^ in 10000 permutation 
rounds, a significant enrichment for b. One can easily imagine, for example, 
genes that have a main function a but also have a different function b in, say, 
40% of documented conditions. 

The point of the following illustration is to show that the incorporation 
of probabilistic annotation information does not always translate to addition 
of terms into the enrichment result, as in the example above, but it can also 
mean the exclusion of non-relevant terms. Consider a hypothetical organism 
with 1100 genes. Let the genes genei to geneioo be grouped together in a 
cluster H after some genomic sequence analysis. Let the term a be annotated 
deterministically (Additional Files) yielding the contingency table: 





Ga 


Gl 


H 


100 







100 


900 



In this situation, H is clearly enriched for a within any meaningful sig- 
nificance cutoff. Let now the same annotation incorporate some evidence 
levels by defining: F{genei G Ga) = 99% for z G {1, ■ ■ ■ , 10} and F{genei G 
Ga) = 1% for i G {11, ■ ■ ■ , 100}. Intuitively, this means that only 10 out of 
100 genes clustered in H are, in fact, confidently annotated with the ontol- 
ogy term a. The incorporation of this information results in non-significant 
enrichment of H for a since: 7 = 0.0425 and p = 0.42 in 1000 permutation 
rounds. Therefore, it can be useful to incorporate uncertainty information 
into the enrichment analysis to also down-rank potentially spurious enrich- 
ment results. 

The purpose of the following illustration is to show the impact of consid- 
ering the uncertainty in lists of genes, rather than in the annotations, on the 
enrichment analysis. In this example, the aim is to find which GO terms, an- 
notating the yeast Saccharomyces cerevisiae, are statistically associated with 
periodic expression levels, measured by microarray technology [7j. Andersson 
and colleagues [7j devised a Bayesian model that produces the probability 
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that a gene is periodically expressed during the cell-cycle. This simple pre- 
sentation is sufficient for our objectives in this work, but the interested reader 
can find more details (e.g. the definition of "periodic", etc.) in the original 
work. In this example, the annotation is considered to be deterministic and 
it was downloaded from the GO project page (March 2007) [1]. 

To perform the usual enrichment analysis one needs to define a probability 
cutoff value in order to spht the gene list in two: the periodic genes and 
the non-periodic genes. Consider initially the reasonable cutoff F{genei is 
periodic) > 70% and focus on a single GO term GO:0007090 (regulation of 
S phase of mitotic cell cycle), defined as "a cell cycle process that modulates 
the frequency, rate or extent of the progression through the S phase of mitotic 
cell cycle". Although this GO term is clearly associated with periodic gene 
expression, performing a usual enrichment analysis results in the conclusion 
that the periodic genes are not significantly enriched for GO:0007090 within 
usual significance cutoffs (p- value = 0.065). 

Suspecting that this non-intuitive result could be due to the probability 
threshold chosen to select periodic genes, illustrated in the Figure [T| one could 
repeat the same analysis above building the contingency table considering 
the cutoffs Figenci is periodic) > 50%, 95%, 99% or 99.99%. The result of 
this repeated analysis is also non-intuitive since the p-values are: 0.12, 1.0, 
1.0 and 1.0 for 50%, 95%, 99% and 99.99% cutoffs, respectively, meaning 
that increasing the stringency to define a gene as periodic only decreases the 
significance of the enrichment for GO: 0007090. 
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Figure 1: Probability of being periodic. The blue curve represents the 
probability of a gene being periodic (Pr) according to the model of [7]. 
The genes are sorted by probability values (rank) on the horizontal axis 
to facilitate the visualization. The red curve is the deterministic approx- 
imation using a 70% probability cutoff to consider a gene as periodic: 
F{genei is periodic) > 0.70 =^ F{genei is periodic) = 1 and F{genei is 
periodic) < 0.70 ^ F{genei is periodic) = 0. This approximation labels 15% 
of the genes as periodic. 



Using ProbCD one can consider the actual probability of being periodic 
(blue curve in Figure [T]) in the enrichment analysis instead of using the de- 
terministic approximation (red curve in Figure [T|. This result in a relatively 
high statistical association between periodicity and the term "regulation of 
S phase of mitotic cell cycle" (7 = 0.78) also with high significance [p = 
0.009 in 1000 simulation rounds). Judging subjectively by the definition of 
GO:0007090, ProbCD returned a meaningful result. 

Other similar cases can be easily identified. For example, the GO term 
GO:0000083 (Gl/S-specific transcription in mitotic cell cycle) exhibits er- 
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ratic behavior depending on the chosen cutoff for the probabihty of being 
periodic: p-value of 0.15, 0.10, 0.01, 0.096 and 1.0 for 50%, 75%, 95%, 99% 
and 99.99% cutoffs, respectively. The probabihty stringency used to build 
the contingency table and the subsequent significance test are not necessarily 
correlated. ProbCD yielded a significant [p = 0.006) moderate association 
(7 = 0.48) for GO:0000083. Other examples include GO:0045787 (positive 
regulation of progression through cell cycle), defined as "any process that 
activates or increases the frequency, rate or extent of progression through the 
cell cycle", which would be called significant using the regular enrichment 
method only if the right probabihty cutoff F^genCi is periodic) > 95% is 
guessed initially: p-value of 0.047, 0.024, 0.0058, 0.086 and 0.024 for 50%, 
75%, 95%, 99% and 99.99%, respectively. 

The above analysis process is repeated for all GO terms, with the results 
available as Additional Files and summarized in Figure |2} 
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Figure 2: Venn diagram of over-represented terms. The Venn diagram shows 
the number of GO terms considered significantly over-represented (p- value < 
0.01) by the Fisher Exact Test using four different probability cutoffs F{genei 
is periodic) > A, B, C or D ^ periodic: A = 0.70, B = 0.95, C = 0.99 and 
D = 0.9999. 

This figure suggests that there is a large variability in the possible final 
outcome of an enrichment analysis depending on the probability cutoff used 
to build the associated contingency table. This variability is avoided by 
ProbCD because it directly takes into account the uncertainty in the data 
instead of introducing a discretization step (Figure [T]). 

Figure [3] shows that ProbCD considers more terms (vertical axis in Figure 
|3| containing the word "cell cycle" , likely associated to periodically expressed 
genes, as significant if compared to the usual enrichment analysis in a wide 
range of significance values {p in Figure |3]). Although this is not a proof, 
since one cannot be certain about which "cell cycle" -marked terms should be 
enriched, we argue that this is a reasonable indication that one can, in fact. 
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avoid the discretization step when building the enrichment problem using 
ProbCD and obtain meaningful results. 
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Figure 3: Fraction of "cell-cycle" GO terms selected as a function of the 
p-value. The curves show the fraction of GO terms containing the word 
"cell-cycle" in their definition that are considered significant as a function of 
the significance cutoff (p). The red curve is obtained with ProbCD and all 
others are obtained with one of the probability cutoffs: 50%, 70%, 95%, 99% 
or 99.99%. 



Discussion and Conclusions 

The usual enrichment analysis is a particular case in this probabilistic frame- 
work and can be obtained by ProbCD ignoring the difference between evi- 
dence sources in gene annotation and defining fixed gene lists, which would 
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correspond to the deterministic setting: = F{genej G Gt) = 1 or and 
Pij = F{genej G outcomei) = 1 or 0. 

Even if a probabilistic annotation is not readily available for a given or- 
ganism, it could be interesting to perform enrichment analysis taking into 
account some form of weighting on available annotations according to their 
reliability. For a concrete example, the GO Consortium [3J provides annota- 
tions accompanied with evidence codes related to the kind/level of evidence 
available for a given GO annotation |5], such as IE A: Inferred from Electronic 
Annotation, IMP: Inferred from Mutant Phenotype, RCA: inferred from Re- 
viewed Computational Analysis or IDA: Inferred from Direct Assay. It is 
known that some evidence sources are more reliable than others and this 
knowledge can be used, in a Bayesian sense, as subjective probabilities. 

Once an annotation is considered in a probabilistic framework, it could re- 
flect a dependence on the context. One can consider cases in which F{genej G 
Gt\ disease ) ^ F{genej G Gt), defining context-dependent gene annotations 
derived, for instance, from automatic literature mining [8]. 

Our intention is to complement existing approaches, rather than substi- 
tute them. Toward this aim, we built ProbCD to be as modular as possible 
in order to be incorporated into existent software or pipelines [24j , composed 
of ontology pre-processing [19] or powerful visualization capabilities [20l |23] . 

It is important to note that ProbCD is also applicable to other categorical 
data analysis contexts in which the construction of contingency tables is 
subject to uncertainty, a recurrent theme in science. 
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